The lecture notes provide code to plot the QDA boundary between regions 2 and 3 of the olive oils data, using linoleic and arachidic acid. Use the same split of training and testing sets as used in lecture notes.
The dot plot of the overall data without standardization
Here we can see the plot of the olive_test data after the standardization
olive test
olive Train
Under the QDA rule, the following equation are for discriminant function
FOr the formula of y
According to the data output showed below, we can get the discriminant function for region 2 is
where in the following output we can get that
will be
$$$$
Under the QDA rule, the data should be a equal variance-covariance. In this question , It’s an unequal variance and covariance, so here we can conclude that the rule is not suitable in this case.
Moreover, the distribution of the observation in each class shld be a normal distribution with a class mean and class ocvariance. As for this question the 2 variables are not normal distribution instead of a multimodal distribtion. Thus, olive data cannot satisfied the assumption
# Call:
# qda(region ~ ., data = olive_train)
#
# Prior probabilities of groups:
# 2 3
# 0.3952096 0.6047904
#
# Group means:
# arachidic linoleic
# 2 0.7483237 1.0746306
# 3 -0.4893105 -0.6861894
# arachidic linoleic
# arachidic 0.178646453 -0.001300898
# linoleic -0.001300898 0.169523093
(i) linear discriminant analysis,
# parsnip model object
#
# Fit time: 21ms
# Call:
# lda(region ~ ., data = data)
#
# Prior probabilities of groups:
# 2 3
# 0.3952096 0.6047904
#
# Group means:
# arachidic linoleic
# 2 0.7483237 1.0746306
# 3 -0.4893105 -0.6861894
#
# Coefficients of linear discriminants:
# LD1
# arachidic -0.9182956
# linoleic -2.0375057
The Linear Discriminant Matrix of the Train data
| .pred_class | 2 | 3 |
|---|---|---|
| 2 | 66 | 2 |
| 3 | 0 | 99 |
| .pred_class | 2 | 3 |
|---|---|---|
| 2 | 32 | 0 |
| 3 | 0 | 50 |
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 1 |
| kap | binary | 1 |
This plot below have already showed the Boundary under the LDA prediction method:
In this question, we plot the boundary under training data for the reason that the training data have more observations compared to the test data, and the boundary under training data is also more precise than the boundary in test data.
(ii)* classification tree (using minsplit of 10) using the rpart engine.
Here we use ‘rpart’ engine to construct the tree, from the following table,we can conclude that there rare 167 roots, there are totally 2 classes are separated. In the class 2 , 66 roots are put in and selection criteria is linoleic>=0.5366322.In class 3, there are totally 101 roots are selected in and criteria is linoleic< 0.5366322.
# n= 167
#
# node), split, n, loss, yval, (yprob)
# * denotes terminal node
#
# 1) root 167 66 3 (0.3952096 0.6047904)
# 2) linoleic>=0.5366322 66 0 2 (1.0000000 0.0000000) *
# 3) linoleic< 0.5366322 101 0 3 (0.0000000 1.0000000) *
In this graph below, we can tell there is a green line made under Classification Tree Method when we use the rpart engine and well separated 2 datasets.
Below are the degree of freedom of 3 methods
# truth predict_lda predict_tree predict_qda
# 33 3 3 3 3
# 34 3 3 3 3
# 35 3 3 3 3
# 36 3 3 3 3
# 37 3 3 3 3
# 38 3 3 3 3
# 39 3 3 3 3
# 40 3 3 3 3
# 41 3 3 3 3
# 42 3 3 3 3
# 43 3 3 3 3
# 44 3 3 3 3
# 45 3 3 3 3
# 46 3 3 3 3
# 47 3 3 3 3
# 48 3 3 3 3
# 49 3 3 3 3
# 50 3 3 3 3
# 51 3 3 3 3
# 52 3 3 3 3
# 53 3 3 3 3
# 54 3 3 3 3
# 55 3 3 3 3
# 56 3 3 3 3
# 57 3 3 3 3
# 58 3 3 3 3
# 59 3 3 3 3
# 60 3 3 3 3
# 61 3 3 3 3
# 62 3 3 3 3
# 63 3 3 3 3
# 64 3 3 3 3
# 65 3 3 3 3
# 66 3 3 3 3
# 67 3 3 3 3
# 68 3 3 3 3
# 69 3 3 3 3
# 70 3 3 3 3
# 71 3 3 3 3
# 72 3 3 3 3
# 73 3 3 3 3
# 74 3 3 3 3
# 75 3 3 3 3
# 76 3 3 3 3
# 77 3 3 3 3
# 78 3 3 3 3
# 79 3 3 3 3
# 80 3 3 3 3
# 81 3 3 3 3
# 82 3 3 3 3
Here we can show that there are totally rows of 32:
# [1] 32
Then we calculate the balanced accuracy as follows:
# [1] 1
# [1] 1
# [1] 1
All of them are equally as accurate. The
this means all of these 3 models have performed well in seperating different classes, as the balanced accuracy for each model are equal to 1 (100%) .
oleic (so you now have three predictors). Write a paragraph discussing how the models change, and why this might be. ?Here, first of all we should take a look of the confusion matrix of all the 3 models.
Here we reform the LDA model with oleic as follows:
# Call:
# lda(region ~ arachidic + linoleic, data = olive_train1)
#
# Prior probabilities of groups:
# 2 3
# 0.3952096 0.6047904
#
# Group means:
# arachidic linoleic
# 2 0.7483237 1.0746306
# 3 -0.4893105 -0.6861894
#
# Coefficients of linear discriminants:
# LD1
# arachidic -0.9182956
# linoleic -2.0375057
We also reform the QDA model with variable oleic
# Call:
# qda(region ~ arachidic + linoleic, data = olive_train1)
#
# Prior probabilities of groups:
# 2 3
# 0.3952096 0.6047904
#
# Group means:
# arachidic linoleic
# 2 0.7483237 1.0746306
# 3 -0.4893105 -0.6861894
We reform tree model after we added the oleic
# n= 167
#
# node), split, n, loss, yval, (yprob)
# * denotes terminal node
#
# 1) root 167 66 3 (0.3952096 0.6047904)
# 2) linoleic>=0.5366322 66 0 2 (1.0000000 0.0000000) *
# 3) linoleic< 0.5366322 101 0 3 (0.0000000 1.0000000) *
The matrix for LDA
# Truth
# Prediction 2 3
# 2 32 0
# 3 0 50
| .metric | .estimator | .estimate |
|---|---|---|
| bal_accuracy | binary | 1 |
Matrix for QDA
# Truth
# Prediction 2 3
# 2 31 0
# 3 1 50
| .metric | .estimator | .estimate |
|---|---|---|
| bal_accuracy | binary | 0.984375 |
Here, in comparison with the 3 models, based on balance accuracy, we can get that after added the other variable ‘oleic’. The precision of LDA and tree model are same as without variable ‘oleic’. This means that after adding an extra variable, the LDA and tree model are still accurate to estimate this new model. However, the accuracy of QDA is changed from 1 to 0.984375. Although this accuracy is still high, however, in comparison with the previous one, adding an extra variable made the balance accuracy of QDA model decreased by 1-0.984375=0.015626
The palmerpenguins is a new R data package, with interesting measurements on penguins of three different species. Subset the data to contain just the Adelie and Gentoo species, and only the variables species and the four physical size measurement variables.
According to the scatter plot, we should select the plot when 2 area are tend to be more close, also within each group, the data should show a great correlation with each other.
# # A tibble: 6 x 5
# species bl bd fl bm
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 Adelie -0.693 0.926 -1.41 -0.680
# 2 Adelie -0.616 0.280 -1.08 -0.620
# 3 Adelie -0.462 0.578 -0.477 -1.28
# 4 Adelie -1.16 1.22 -0.610 -1.04
# 5 Adelie -0.655 1.87 -0.809 -0.799
# 6 Adelie -0.732 0.479 -1.41 -0.829
# [1] 274 5
# Levene's Test for Homogeneity of Variance (center = median)
# Df F value Pr(>F)
# group 1 1.6441 0.2009
# 272
# Levene's Test for Homogeneity of Variance (center = median)
# Df F value Pr(>F)
# group 1 3.5213 0.06166 .
# 272
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Levene's Test for Homogeneity of Variance (center = median)
# Df F value Pr(>F)
# group 1 0.157 0.6922
# 272
# Levene's Test for Homogeneity of Variance (center = median)
# Df F value Pr(>F)
# group 1 1.993 0.1592
# 272
Here, by observing the quantile performance, we can get the bl and fl separate 2 species well.
Then we can get more detailed look of by applying “featurePlot” analysis to have a look at the quantitative variable across the 2 species.
We clearly observe that median and interquartile ranges of the 2 penguin species are quite separated , though by different features in each species. By separation, the means and interquartile range of one class is distinct and don’t overlap with the other classes.
Here the variable bl and fl are estimated to separate the data well.
From scatter plot, the data points are well separated with any combination that contains bl,which means the bill length may significantly different between 2 species.
Moreover, from barplot, the overlap between 2 species under bl are also small, which means 2 classes are more separated under bl criteria
For fl inside the boxplot we can tell it has the lowest overlap among all the 4 variables. Meaning that for different species, there is a large difference between them. For scatterplot for fl, even though they are not as good as what bl displayed, since only very few data are mixed, we can still tell the difference among the 2 classees.
It’s fair to assume the variance and covariance for each group as homogeneous for the reason that according to the plot. They basically have same shape among all options here.
For the reason, Homogeneous variance-covariance means that each group has the same shape. The variance can be different in different projections, but it is the same for each group.It’s fair to assume the variance and covariance for each group as homogeneous for the reason that according to the plot , for this implies that variables from different datasets may have common or similar variance (for data contain Adelie only and data contain Gentoo only)
# Call:
# lda(species ~ ., data = penguin_train, prior = c(0.5, 0.5))
#
# Prior probabilities of groups:
# Adelie Gentoo
# 0.5 0.5
#
# Group means:
# bl bd fl bm
# Adelie -0.7748322 0.7029637 -0.8448656 -0.7544736
# Gentoo 0.9019430 -0.9331506 0.9727739 0.8715777
#
# Coefficients of linear discriminants:
# LD1
# bl 0.6885892
# bd -1.9509187
# fl 1.4069513
# bm 0.9037988
confusion matrix
# Confusion Matrix and Statistics
#
# Reference
# Prediction Adelie Gentoo
# Adelie 50 0
# Gentoo 0 41
#
# Accuracy : 1
# 95% CI : (0.9603, 1)
# No Information Rate : 0.5495
# P-Value [Acc > NIR] : < 2.2e-16
#
# Kappa : 1
#
# Mcnemar's Test P-Value : NA
#
# Sensitivity : 1.0000
# Specificity : 1.0000
# Pos Pred Value : 1.0000
# Neg Pred Value : 1.0000
# Prevalence : 0.5495
# Detection Rate : 0.5495
# Detection Prevalence : 0.5495
# Balanced Accuracy : 1.0000
#
# 'Positive' Class : Adelie
#
Here the model appears to be accurate. THe overall accuracy is close to 1 and nearly 1 kappa on the test data.
According to the calculation method of the model error.
We can get that in the current dataset, everything matched well:
# [1] 0
According to the LDA rule we can get that the following equation.New observation of X0 us belongs to class 1 if
In LDA’s definition, Class 1 and 2 need to be mapped in the your data. The class to the right on the reduced dimension will be class 1 in the equqation. So here after comparing the value with Adelie Penguin and Gentoo Penguin
# LD1
# Adelie -3.775543
# Gentoo 4.597946
# # A tibble: 4 x 2
# variables `Different mean between group -`
# <chr> <dbl>
# 1 BL -1.68
# 2 BD 1.64
# 3 FL -1.82
# 4 BM -1.63
# # A tibble: 4 x 2
# variables `Different mean between group +`
# <chr> <dbl>
# 1 BL 0.127
# 2 BD -0.230
# 3 FL 0.128
# 4 BM 0.117
matlib package to compute matrix inverses, see help here.)We select the group means:
# bl bd fl bm
# Adelie -0.7748322 0.7029637 -0.8448656 -0.7544736
# Gentoo 0.9019430 -0.9331506 0.9727739 0.8715777
Here we calculate the variance for each group
The variance of group Adelie
# bl bd fl bm
# bl 0.22524992 0.12158571 0.06008633 0.12070965
# bd 0.12158571 0.37552815 0.07290338 0.18188810
# fl 0.06008633 0.07290338 0.17334046 0.09076054
# bm 0.12070965 0.18188810 0.09076054 0.25946180
The variance of group Gentoo
# bl bd fl bm
# bl 0.3423178 0.1845119 0.1516292 0.2441749
# bd 0.1845119 0.2589256 0.1462688 0.2292669
# fl 0.1516292 0.1462688 0.1613951 0.1782303
# bm 0.2441749 0.2292669 0.1782303 0.3879481
Then we got the sum numbers of Adelie and Gentoo which are
According to the formula, we substitute the sum of numbers above into the pooled variance calculation we can get that
Here is the pooled variance-covariance matrix
# [,1]
# bl -5.765894
# bd 16.335996
# fl -11.781091
# bm -7.567949
now we make a plot about the discriminant space of setA.
# species bl bd fl
# Adelie:50 Min. :-2.04076 Min. :-1.65696 Min. :-1.73970
# Gentoo:41 1st Qu.:-0.78007 1st Qu.:-0.91195 1st Qu.:-0.80934
# Median :-0.01981 Median : 0.13106 Median :-0.07834
# Mean : 0.04724 Mean : 0.06065 Mean : 0.06114
# 3rd Qu.: 0.76933 3rd Qu.: 0.87606 3rd Qu.: 0.95170
# Max. : 2.54007 Max. : 2.16740 Max. : 1.84884
# bm LD1
# Min. :-1.75620 Min. :-6.720
# 1st Qu.:-0.85900 1st Qu.:-4.298
# Median : 0.06811 Median :-2.680
# Mean : 0.05201 Mean :-0.364
# 3rd Qu.: 0.81578 3rd Qu.: 4.261
# Max. : 1.95223 Max. : 6.300
Here, by using an density function we can also get that
According to the density plot aren’t intersect with each other, that means there is a good separation of the 2 classes of the Linear Discriminant method.
# # A tibble: 6 x 5
# species bl bd fl bm
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 Adelie -0.883 0.784 -1.42 -0.563
# 2 Adelie -0.810 0.126 -1.06 -0.501
# 3 Adelie -0.663 0.430 -0.421 -1.19
# 4 Adelie -1.32 1.09 -0.563 -0.937
# 5 Adelie -0.847 1.75 -0.776 -0.688
# 6 Adelie -0.920 0.329 -1.42 -0.719
# [1] 342 5
** Box plot and density plot can clearly show the relationship between four variables and easily tell which variable are important for distinguishing**
TOur graph with 4 different dimensions.
# parsnip model object
#
# Fit time: 20ms
# Call:
# lda(species ~ ., data = data, prior = ~c(1/3, 1/3, 1/3))
#
# Prior probabilities of groups:
# Adelie Chinstrap Gentoo
# 0.3333333 0.3333333 0.3333333
#
# Group means:
# bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
# Adelie 38.85545 18.40099 189.6238 3729.455
# Chinstrap 48.46957 18.34783 195.6522 3688.043
# Gentoo 47.67073 14.96707 217.1220 5049.085
#
# Coefficients of linear discriminants:
# LD1 LD2
# bill_length_mm 0.037062318 -0.413284602
# bill_depth_mm -1.020639811 0.092714992
# flipper_length_mm 0.092626372 -0.002792050
# body_mass_g 0.001515876 0.001782931
#
# Proportion of trace:
# LD1 LD2
# 0.835 0.165
Here we can conclude after taking ‘chinstrap’ into consideration, we can get that bd are easy to tell the difference in order to separate Gentoo from Adelie and Chinstrap. Moreover, the value of bm can also separating Gentoo from Adelie and Chinstrap.
Furthermore, in such situation, we should always foruc on the tour to choose which variable are useful to separate species between Chinstrap and Adelie. According to the tour above, the green point is Adelie , whileas the orange point is chinstrap.According to the tour we can easily tell the bl is the most important criteria for the reason that after the projection, majority of green and orange cannot project on the bl. Based on this witness, we can say bl will retain important data for Chinstrap
Impurity metrics can be other than Gini or entropy. This is a metric proposed by Buja and Lee (2001) is called a one-sided extreme:
We are going to use this metric to build a spam filter for the spam data used in tutorial 4.
spam=no. Explain why.According to the rule of thumb, put the most important classes among the 2 classes into class one. For the reason the program asked us about separating spam, so ‘spam=no’ is more important, thus we put it into class 1.
After remove the spampct column
# # A tibble: 1,998 x 20
# isuid id `day of week` `time of day` size.kb box domain local digits
# <dbl> <dbl> <fct> <dbl> <dbl> <chr> <chr> <chr> <dbl>
# 1 1 1 Thu 0 7 no com no 0
# 2 1 2 Thu 0 2 no com no 0
# 3 1 3 Thu 14 3 no edu yes 0
# 4 1 9 Thu 6 3 yes edu yes 0
# 5 1 11 Thu 7 3 no com no 0
# 6 1 13 Thu 8 12 yes com no 0
# 7 1 14 Thu 8 12 yes com no 0
# 8 1 16 Thu 9 2 yes edu yes 0
# 9 1 19 Thu 10 2 no edu yes 0
# 10 1 23 Thu 12 3 no edu yes 0
# # ... with 1,988 more rows, and 11 more variables: name <chr>, cappct <dbl>,
# # special <dbl>, credit <chr>, sucker <chr>, porn <chr>, chain <chr>,
# # username <chr>, large text <chr>, category <chr>, spam <chr>
For all the possible split, the graph will be as follows. For each left and right bucket, will have 15 splits(rows).Each row of left are combined together with right are original 5 domain variables.
# [,1] [,2]
# [1,] 1 2
# [2,] 1 3
# [3,] 1 4
# [4,] 1 5
# [5,] 1 2
# [6,] 2 3
# [7,] 2 4
# [8,] 2 5
# [9,] 1 3
# [10,] 2 3
# [11,] 3 4
# [12,] 3 5
# [13,] 1 4
# [14,] 2 4
# [15,] 3 4
# [16,] 4 5
# [17,] 1 5
# [18,] 2 5
# [19,] 3 5
# [20,] 4 5
# [1] 807
# [1] 1998
Here we can get the right bucket will be:
# [[1]]
# [1] "edu" "net" "org" "gov"
#
# [[2]]
# [1] "com" "net" "org" "gov"
#
# [[3]]
# [1] "com" "edu" "org" "gov"
#
# [[4]]
# [1] "com" "edu" "net" "gov"
#
# [[5]]
# [1] "com" "edu" "net" "org"
#
# [[6]]
# [1] "net" "org" "gov"
#
# [[7]]
# [1] "edu" "org" "gov"
#
# [[8]]
# [1] "edu" "net" "gov"
#
# [[9]]
# [1] "edu" "net" "org"
#
# [[10]]
# [1] "com" "org" "gov"
#
# [[11]]
# [1] "com" "net" "gov"
#
# [[12]]
# [1] "com" "net" "org"
#
# [[13]]
# [1] "com" "edu" "gov"
#
# [[14]]
# [1] "com" "edu" "org"
#
# [[15]]
# [1] "com" "edu" "net"
The Left bucket will be
# [[1]]
# [1] "com"
#
# [[2]]
# [1] "edu"
#
# [[3]]
# [1] "net"
#
# [[4]]
# [1] "org"
#
# [[5]]
# [1] "gov"
#
# [[6]]
# [1] "com" "edu"
#
# [[7]]
# [1] "com" "net"
#
# [[8]]
# [1] "com" "org"
#
# [[9]]
# [1] "com" "gov"
#
# [[10]]
# [1] "edu" "net"
#
# [[11]]
# [1] "edu" "org"
#
# [[12]]
# [1] "edu" "gov"
#
# [[13]]
# [1] "net" "org"
#
# [[14]]
# [1] "net" "gov"
#
# [[15]]
# [1] "org" "gov"
Then the tested balance will be
# # A tibble: 15 x 2
# index ose
# <dbl> <dbl>
# 1 1 0.694
# 2 2 0.00868
# 3 3 0.729
# 4 4 0.154
# 5 5 0.1
# 6 6 0.309
# 7 7 0.698
# 8 8 0.677
# 9 9 0.687
# 10 10 0.0823
# 11 11 0.0122
# 12 12 0.00955
# 13 13 0.625
# 14 14 0.680
# 15 15 0.139
Inside this list, among all 15 observations, the 2nd observation has the least ose which is 0.008679.
We should choose 0.008679 as the best split where the index of this split is 2. In this case we can get that inside the left bucket will have domain as com,gov,net,org , in the right bucket will have the domain with only edu.
size.kb would the split be made? (Using a minimum split value of 10.)# optimal_ose optimal_ose_v
# [1,] 0 113
Here after we called the function we set, we can get that when the size.kb is equal to 113, the OSE under minsplit=10 is least among all the other groups.
For the optimal_ose is equals to 0, there maybe have decimals haven’t been displayed here, so we can conclude under size = 113, the optimal_ose is least among all the other group.
Below are the columns names of the spam data.
# [1] "isuid" "id" "day of week" "time of day" "size.kb"
# [6] "box" "domain" "local" "digits" "name"
# [11] "cappct" "special" "credit" "sucker" "porn"
# [16] "chain" "username" "large text" "category" "spam"
We use the ose function to calculate OSE for variable “time of the day”
# optimal_ose optimal_ose_v
# [1,] 0.2776457 7
We use the function to calculate the variable “special”
# optimal_ose optimal_ose_v
# [1,] 0.2638581 0
Here, we use the function to calculate the OSE of variable “digits”
# optimal_ose optimal_ose_v
# [1,] 0.2657224 0
Here, we use the function to calculate the OSE of variable “cappct”
# optimal_ose optimal_ose_v
# [1,] 0.273913 0
As for the calculation above, we can apply the function we set above, we can conclude that the best split will be generated by variable special, with the value of optimal OSE is 0.2638581. Here, for the reason that there are no decimal been kept and special can only be integer . Thus, as for varialbe ‘special’ only, when special is zero we can get the optimal_ose and the ose value when variable ‘special’ is zero is 0.2638581
The Gini measure formula will be
For OSE the function is
For Gini index , the lower value the better measurement from the following graph we can conclude that when is the worst, while or , Gini index will perform better.
For OSE , the lower value of OSE means the better model is. When the OSE will be very small, which is the optimal split, when , the value of OSE is high, which is worse split
Principal component analysis is often used to create indicator variables (see e.g. Constructing socio-economic status indices: how to use principal components analysis). In this question, you will look at the socioeconomic data provided on kaggle to create an indicator variable, using PCA.
# spec_tbl_df[,10] [167 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
# $ country : chr [1:167] "Afghanistan" "Albania" "Algeria" "Angola" ...
# $ child_mort: num [1:167] 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
# $ exports : num [1:167] 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
# $ health : num [1:167] 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
# $ imports : num [1:167] 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
# $ income : num [1:167] 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
# $ inflation : num [1:167] 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
# $ life_expec: num [1:167] 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
# $ total_fer : num [1:167] 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
# $ gdpp : num [1:167] 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
# - attr(*, "spec")=
# .. cols(
# .. country = col_character(),
# .. child_mort = col_double(),
# .. exports = col_double(),
# .. health = col_double(),
# .. imports = col_double(),
# .. income = col_double(),
# .. inflation = col_double(),
# .. life_expec = col_double(),
# .. total_fer = col_double(),
# .. gdpp = col_double()
# .. )
We can conclude there are 9 PCs need to be calculated for there are totally 9 variables.
# Standard deviations (1, .., p=9):
# [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
# [8] 0.2971790 0.2586020
#
# Rotation (n x k) = (9 x 9):
# PC1 PC2 PC3 PC4 PC5
# child_mort -0.4195194 -0.192883937 0.02954353 -0.370653262 0.16896968
# exports 0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
# health 0.1508378 0.243086779 0.59663237 -0.461897497 -0.51800037
# imports 0.1614824 -0.671820644 0.29992674 0.071907461 -0.25537642
# income 0.3984411 -0.022535530 -0.30154750 -0.392159039 0.24714960
# inflation -0.1931729 0.008404473 -0.64251951 -0.150441762 -0.71486910
# life_expec 0.4258394 0.222706743 -0.11391854 0.203797235 -0.10821980
# total_fer -0.4037290 -0.155233106 -0.01954925 -0.378303645 0.13526221
# gdpp 0.3926448 0.046022396 -0.12297749 -0.531994575 0.18016662
# PC6 PC7 PC8 PC9
# child_mort -0.200628153 0.07948854 0.68274306 0.32754180
# exports 0.059332832 0.70730269 0.01419742 -0.12308207
# health -0.007276456 0.24983051 -0.07249683 0.11308797
# imports 0.030031537 -0.59218953 0.02894642 0.09903717
# income -0.160346990 -0.09556237 -0.35262369 0.61298247
# inflation -0.066285372 -0.10463252 0.01153775 -0.02523614
# life_expec 0.601126516 -0.01848639 0.50466425 0.29403981
# total_fer 0.750688748 -0.02882643 -0.29335267 -0.02633585
# gdpp -0.016778761 -0.24299776 0.24969636 -0.62564572
According to the table above, there are totally 9PCs are possible to calculate in this data.
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | |
|---|---|---|---|---|---|---|---|---|---|
| variance | 4.1357 | 1.5463 | 1.1704 | 0.9948 | 0.6606 | 0.2236 | 0.1134 | 0.0883 | 0.0669 |
| peopoerion | 0.4595 | 0.1718 | 0.1300 | 0.1105 | 0.0734 | 0.0248 | 0.0126 | 0.0098 | 0.0074 |
| Cum.prop | 0.4595 | 0.6313 | 0.7614 | 0.8719 | 0.9453 | 0.9702 | 0.9828 | 0.9926 | 1.0000 |
The first principal component of a set of variables is the linear combination
We can get the following summary as
# Importance of components:
# PC1 PC2 PC3 PC4 PC5 PC6 PC7
# Standard deviation 2.0336 1.2435 1.0818 0.9974 0.8128 0.47284 0.3368
# Proportion of Variance 0.4595 0.1718 0.1300 0.1105 0.0734 0.02484 0.0126
# Cumulative Proportion 0.4595 0.6313 0.7614 0.8719 0.9453 0.97015 0.9828
# PC8 PC9
# Standard deviation 0.29718 0.25860
# Proportion of Variance 0.00981 0.00743
# Cumulative Proportion 0.99257 1.00000
According to the graph, the proportion of variance explained by the first PC is 0.4595, that is 45.95% data are explained by PC1
The loading of PC1 is shown as follows. Based on the plot of loadings, there are lots of variability in the loadings coefficient and a bootstrap is requred to be used in order to generated a more precise and relevant variables.
According the graph we can conclude that on PC1, the following variables ,, , $life_{expec}and . That means these variables are significantly diffferent from ,so in such situation we should take it as their means outside the confidence interval. Meanwhile, from the other aspect, the , $ health$, and is not slightly different from and within confidence interval , however , non of them accross the “0” boundary. Thus, we can conclude these variables are less significance and can be rejected to fit a new model.
# # A tibble: 9 x 9
# PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.426 0.223 -0.114 0.204 -0.108 0.601 -0.0185 0.505 0.294
# 2 0.398 -0.0225 -0.302 -0.392 0.247 -0.160 -0.0956 -0.353 0.613
# 3 0.393 0.0460 -0.123 -0.532 0.180 -0.0168 -0.243 0.250 -0.626
# 4 0.284 -0.613 -0.145 -0.00309 -0.0576 0.0593 0.707 0.0142 -0.123
# 5 0.161 -0.672 0.300 0.0719 -0.255 0.0300 -0.592 0.0289 0.0990
# 6 0.151 0.243 0.597 -0.462 -0.518 -0.00728 0.250 -0.0725 0.113
# 7 -0.193 0.00840 -0.643 -0.150 -0.715 -0.0663 -0.105 0.0115 -0.0252
# 8 -0.404 -0.155 -0.0195 -0.378 0.135 0.751 -0.0288 -0.293 -0.0263
# 9 -0.420 -0.193 0.0295 -0.371 0.169 -0.201 0.0795 0.683 0.328
We here after select the useful variables, we construct a newpca that is consisted of the useful variables here.
# Standard deviations (1, .., p=5):
# [1] 1.9065314 0.9721904 0.4821562 0.3239670 0.2873232
#
# Rotation (n x k) = (5 x 5):
# PC1 PC2 PC3 PC4 PC5
# child_mort -0.4654859 -0.4031338 1.944752e-01 -0.2900511 0.7062972
# income 0.4297577 -0.5358646 1.499573e-01 0.6779796 0.2145085
# life_expec 0.4790410 0.2223537 -6.274361e-01 -0.1626967 0.5485730
# total_fer -0.4420473 -0.4036622 -7.389291e-01 0.1993868 -0.2363890
# gdpp 0.4168274 -0.5813329 1.490296e-05 -0.6244907 -0.3135575
# Importance of components:
# PC1 PC2 PC3 PC4 PC5
# Standard deviation 1.907 0.9722 0.48216 0.32397 0.28732
# Proportion of Variance 0.727 0.1890 0.04649 0.02099 0.01651
# Cumulative Proportion 0.727 0.9160 0.96250 0.98349 1.00000
The new formula of PC1 will be
** Here’s the reason why we should choose PC1 as our indicator variable is first of all, PC1 has a Cumulative proportion
After we selected the significant variables, We have made a pca4 only consists
According to the Biplot,luxembourg,Singapore,Qatar , Malta are the 5 highest value of PC1. Meanwhile, Nigeria,Haiti,Central African Republic, chad are 4 countries with smallest PC1. Based on the macroeconomic theory, we can conclude the developed countries are tend to have high PC1, those developing countries with low development status will tend to have low PC1s. Also we can have a look of the coloured plot, we can conclude that high value of imports, exports, income , health , life_expec,gdpp will have positive impact on PC1. On the contrary, variables like child_mort, inflation,total_fer will also have a negative impact on PC1.
All in all, based on the two plots, we can say that a country with a high PC1 will also high values on imports, exports, income , health , life_expec,gdpp and low values on child_mort, inflation,total_fer, the countries are tend to be more developed. A country with low PC1 will have low values on imports, exports, income , health , life_expec,gdpp and high value on child_mort, inflation,total_fer and tend to be less developed country
e1071 David Meyer, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel and Friedrich Leisch (2021). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-6. https://CRAN.R-project.org/package=e1071
tidyverse Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
GGally Barret Schloerke, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg and Jason Crowley (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.1. https://CRAN.R-project.org/package=GGally
MASS Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
Discrim
Max Kuhn (2020). discrim: Model Wrappers for Discriminant Analysis. R package version 0.1.1. https://CRAN.R-project.org/package=discrim
tidymodels
Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
KableExtra Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra
rpart
Terry Therneau and Beth Atkinson (2019). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15. https://CRAN.R-project.org/package=rpart
palmerpenguins
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/
plotly
C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.
tidyr
Hadley Wickham (2021). tidyr: Tidy Messy Data. R package version 1.1.3. https://CRAN.R-project.org/package=tidyr
tourr
Hadley Wickham, Dianne Cook, Heike Hofmann, Andreas Buja (2011). tourr: An R Package for Exploring Multivariate Data with Projections. Journal of Statistical Software, 40(2), 1-18. URL http://www.jstatsoft.org/v40/i02/.
rsample Julia Silge, Fanny Chow, Max Kuhn and Hadley Wickham (2021). rsample: General Resampling Infrastructure. R package version 0.0.9. https://CRAN.R-project.org/package=rsample
persnip
Max Kuhn and Davis Vaughan (2021). parsnip: A Common API to Modeling and Analysis Functions. R package version 0.1.5. https://CRAN.R-project.org/package=parsnip
yardstick
Max Kuhn and Davis Vaughan (2021). yardstick: Tidy Characterizations of Model Performance. R package version 0.0.8. https://CRAN.R-project.org/package=yardstick
spinifex
Nicholas Spyrison and Dianne Cook (2021). spinifex: Manual Tours, Manual Control of Dynamic Projections of Numeric Multivariate Data. R package version 0.2.8. https://CRAN.R-project.org/package=spinifex
dplyr
Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.5. https://CRAN.R-project.org/package=dplyr
skimr Elin Waring, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu and Shannon Ellis (2021). skimr: Compact and Flexible Summaries of Data. R package version 2.1.3. https://CRAN.R-project.org/package=skimr
caret Max Kuhn (2020). caret: Classification and Regression Training. R package version 6.0-86. https://CRAN.R-project.org/package=caret
ggplot2 H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
klaR Weihs, C., Ligges, U., Luebke, K. and Raabe, N. (2005). klaR Analyzing German Business Cycles. In Baier, D., Decker, R. and Schmidt-Thieme, L. (eds.). Data Analysis and Decision Support, 335-343, Springer-Verlag, Berlin.
knitr
Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.33.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
boot Angelo Canty and Brian Ripley (2021). boot: Bootstrap R (S-Plus) Functions. R package version 1.3-27.
Davison, A. C. & Hinkley, D. V. (1997) Bootstrap Methods and Their Applications. Cambridge University Press, Cambridge. ISBN 0-521-57391-2
ggrepel Kamil Slowikowski (2021). ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’. R package version 0.9.1. https://CRAN.R-project.org/package=ggrepel
#
# To cite package 'ggrepel' in publications use:
#
# Kamil Slowikowski (2021). ggrepel: Automatically Position
# Non-Overlapping Text Labels with 'ggplot2'. R package version 0.9.1.
# https://CRAN.R-project.org/package=ggrepel
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {ggrepel: Automatically Position Non-Overlapping Text Labels with
# 'ggplot2'},
# author = {Kamil Slowikowski},
# year = {2021},
# note = {R package version 0.9.1},
# url = {https://CRAN.R-project.org/package=ggrepel},
# }
#
# Wickham et al., (2019). Welcome to the tidyverse. Journal of Open
# Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
#
# A BibTeX entry for LaTeX users is
#
# @Article{,
# title = {Welcome to the {tidyverse}},
# author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
# year = {2019},
# journal = {Journal of Open Source Software},
# volume = {4},
# number = {43},
# pages = {1686},
# doi = {10.21105/joss.01686},
# }
#
# Kuhn et al., (2020). Tidymodels: a collection of packages for
# modeling and machine learning using tidyverse principles.
# https://www.tidymodels.org
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles.},
# author = {Max Kuhn and Hadley Wickham},
# url = {https://www.tidymodels.org},
# year = {2020},
# }
#
# To cite package 'discrim' in publications use:
#
# Max Kuhn (2020). discrim: Model Wrappers for Discriminant Analysis. R
# package version 0.1.1. https://CRAN.R-project.org/package=discrim
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {discrim: Model Wrappers for Discriminant Analysis},
# author = {Max Kuhn},
# year = {2020},
# note = {R package version 0.1.1},
# url = {https://CRAN.R-project.org/package=discrim},
# }
#
# To cite the MASS package in publications use:
#
# Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with
# S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
#
# A BibTeX entry for LaTeX users is
#
# @Book{,
# title = {Modern Applied Statistics with S},
# author = {W. N. Venables and B. D. Ripley},
# publisher = {Springer},
# edition = {Fourth},
# address = {New York},
# year = {2002},
# note = {ISBN 0-387-95457-0},
# url = {https://www.stats.ox.ac.uk/pub/MASS4/},
# }
#
# To cite the 'boot' package in publications use:
#
# Angelo Canty and Brian Ripley (2021). boot: Bootstrap R (S-Plus)
# Functions. R package version 1.3-27.
#
# Davison, A. C. & Hinkley, D. V. (1997) Bootstrap Methods and Their
# Applications. Cambridge University Press, Cambridge. ISBN
# 0-521-57391-2
#
# To see these entries in BibTeX format, use 'print(<citation>,
# bibtex=TRUE)', 'toBibtex(.)', or set
# 'options(citation.bibtex.max=999)'.
#
# To cite package 'ggrepel' in publications use:
#
# Kamil Slowikowski (2021). ggrepel: Automatically Position
# Non-Overlapping Text Labels with 'ggplot2'. R package version 0.9.1.
# https://CRAN.R-project.org/package=ggrepel
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {ggrepel: Automatically Position Non-Overlapping Text Labels with
# 'ggplot2'},
# author = {Kamil Slowikowski},
# year = {2021},
# note = {R package version 0.9.1},
# url = {https://CRAN.R-project.org/package=ggrepel},
# }
#
# To cite package 'kableExtra' in publications use:
#
# Hao Zhu (2021). kableExtra: Construct Complex Table with 'kable' and
# Pipe Syntax. R package version 1.3.4.
# https://CRAN.R-project.org/package=kableExtra
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
# author = {Hao Zhu},
# year = {2021},
# note = {R package version 1.3.4},
# url = {https://CRAN.R-project.org/package=kableExtra},
# }
#
# To cite package 'caret' in publications use:
#
# Max Kuhn (2020). caret: Classification and Regression Training. R
# package version 6.0-86. https://CRAN.R-project.org/package=caret
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {caret: Classification and Regression Training},
# author = {Max Kuhn},
# year = {2020},
# note = {R package version 6.0-86},
# url = {https://CRAN.R-project.org/package=caret},
# }
#
# To cite package 'dplyr' in publications use:
#
# Hadley Wickham, Romain François, Lionel Henry and Kirill Müller
# (2021). dplyr: A Grammar of Data Manipulation. R package version
# 1.0.5. https://CRAN.R-project.org/package=dplyr
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {dplyr: A Grammar of Data Manipulation},
# author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller},
# year = {2021},
# note = {R package version 1.0.5},
# url = {https://CRAN.R-project.org/package=dplyr},
# }
#
# To cite tourr in publications use:
#
# Hadley Wickham, Dianne Cook, Heike Hofmann, Andreas Buja (2011).
# tourr: An R Package for Exploring Multivariate Data with Projections.
# Journal of Statistical Software, 40(2), 1-18. URL
# http://www.jstatsoft.org/v40/i02/.
#
# When using the slice tour, please cite:
#
# Ursula Laa, Dianne Cook, German Valencia (2020). A slice tour for
# finding hollowness in high-dimensional data. Journal of Computational
# and Graphical Statistics, 29:3, 681-687. DOI:
# 10.1080/10618600.2020.1777140
#
# To see these entries in BibTeX format, use 'print(<citation>,
# bibtex=TRUE)', 'toBibtex(.)', or set
# 'options(citation.bibtex.max=999)'.
#
# To cite package 'magrittr' in publications use:
#
# Stefan Milton Bache and Hadley Wickham (2020). magrittr: A
# Forward-Pipe Operator for R. R package version 2.0.1.
# https://CRAN.R-project.org/package=magrittr
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {magrittr: A Forward-Pipe Operator for R},
# author = {Stefan Milton Bache and Hadley Wickham},
# year = {2020},
# note = {R package version 2.0.1},
# url = {https://CRAN.R-project.org/package=magrittr},
# }
#
# To cite ggplot2 in publications, please use:
#
# H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
# Springer-Verlag New York, 2016.
#
# A BibTeX entry for LaTeX users is
#
# @Book{,
# author = {Hadley Wickham},
# title = {ggplot2: Elegant Graphics for Data Analysis},
# publisher = {Springer-Verlag New York},
# year = {2016},
# isbn = {978-3-319-24277-4},
# url = {https://ggplot2.tidyverse.org},
# }
#
# To cite package 'rpart' in publications use:
#
# Terry Therneau and Beth Atkinson (2019). rpart: Recursive
# Partitioning and Regression Trees. R package version 4.1-15.
# https://CRAN.R-project.org/package=rpart
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {rpart: Recursive Partitioning and Regression Trees},
# author = {Terry Therneau and Beth Atkinson},
# year = {2019},
# note = {R package version 4.1-15},
# url = {https://CRAN.R-project.org/package=rpart},
# }
#
# To cite package 'rpart.plot' in publications use:
#
# Stephen Milborrow (2020). rpart.plot: Plot 'rpart' Models: An
# Enhanced Version of 'plot.rpart'. R package version 3.0.9.
# https://CRAN.R-project.org/package=rpart.plot
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {rpart.plot: Plot 'rpart' Models: An Enhanced Version of 'plot.rpart'},
# author = {Stephen Milborrow},
# year = {2020},
# note = {R package version 3.0.9},
# url = {https://CRAN.R-project.org/package=rpart.plot},
# }
#
# ATTENTION: This citation information has been auto-generated from the
# package DESCRIPTION file and may need manual editing, see
# 'help("citation")'.
#
# To cite the 'knitr' package in publications use:
#
# Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report
# Generation in R. R package version 1.33.
#
# Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition.
# Chapman and Hall/CRC. ISBN 978-1498716963
#
# Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible
# Research in R. In Victoria Stodden, Friedrich Leisch and Roger D.
# Peng, editors, Implementing Reproducible Computational Research.
# Chapman and Hall/CRC. ISBN 978-1466561595
#
# To see these entries in BibTeX format, use 'print(<citation>,
# bibtex=TRUE)', 'toBibtex(.)', or set
# 'options(citation.bibtex.max=999)'.
#
# To cite klaR in publications use:
#
# Weihs, C., Ligges, U., Luebke, K. and Raabe, N. (2005). klaR
# Analyzing German Business Cycles. In Baier, D., Decker, R. and
# Schmidt-Thieme, L. (eds.). Data Analysis and Decision Support,
# 335-343, Springer-Verlag, Berlin.
#
# A BibTeX entry for LaTeX users is
#
# @InProceedings{,
# title = {klaR Analyzing German Business Cycles},
# author = {Claus Weihs and Uwe Ligges and Karsten Luebke and Nils Raabe},
# booktitle = {Data Analysis and Decision Support},
# editor = {D. Baier and R. Decker and L. Schmidt-Thieme},
# publisher = {Springer-Verlag},
# address = {Berlin},
# year = {2005},
# pages = {335-343},
# }
#
# To cite package 'skimr' in publications use:
#
# Elin Waring, Michael Quinn, Amelia McNamara, Eduardo Arino de la
# Rubia, Hao Zhu and Shannon Ellis (2021). skimr: Compact and Flexible
# Summaries of Data. R package version 2.1.3.
# https://CRAN.R-project.org/package=skimr
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {skimr: Compact and Flexible Summaries of Data},
# author = {Elin Waring and Michael Quinn and Amelia McNamara and Eduardo {Arino de la Rubia} and Hao Zhu and Shannon Ellis},
# year = {2021},
# note = {R package version 2.1.3},
# url = {https://CRAN.R-project.org/package=skimr},
# }
#
# To cite package 'spinifex' in publications use:
#
# Nicholas Spyrison and Dianne Cook (2021). spinifex: Manual Tours,
# Manual Control of Dynamic Projections of Numeric Multivariate Data. R
# package version 0.2.8. https://CRAN.R-project.org/package=spinifex
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {spinifex: Manual Tours, Manual Control of Dynamic Projections of Numeric
# Multivariate Data},
# author = {Nicholas Spyrison and Dianne Cook},
# year = {2021},
# note = {R package version 0.2.8},
# url = {https://CRAN.R-project.org/package=spinifex},
# }
#
# To cite package 'yardstick' in publications use:
#
# Max Kuhn and Davis Vaughan (2021). yardstick: Tidy Characterizations
# of Model Performance. R package version 0.0.8.
# https://CRAN.R-project.org/package=yardstick
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {yardstick: Tidy Characterizations of Model Performance},
# author = {Max Kuhn and Davis Vaughan},
# year = {2021},
# note = {R package version 0.0.8},
# url = {https://CRAN.R-project.org/package=yardstick},
# }
#
# To cite package 'parsnip' in publications use:
#
# Max Kuhn and Davis Vaughan (2021). parsnip: A Common API to Modeling
# and Analysis Functions. R package version 0.1.5.
# https://CRAN.R-project.org/package=parsnip
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {parsnip: A Common API to Modeling and Analysis Functions},
# author = {Max Kuhn and Davis Vaughan},
# year = {2021},
# note = {R package version 0.1.5},
# url = {https://CRAN.R-project.org/package=parsnip},
# }
#
# To cite package 'rsample' in publications use:
#
# Julia Silge, Fanny Chow, Max Kuhn and Hadley Wickham (2021). rsample:
# General Resampling Infrastructure. R package version 0.0.9.
# https://CRAN.R-project.org/package=rsample
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {rsample: General Resampling Infrastructure},
# author = {Julia Silge and Fanny Chow and Max Kuhn and Hadley Wickham},
# year = {2021},
# note = {R package version 0.0.9},
# url = {https://CRAN.R-project.org/package=rsample},
# }
#
# To cite plotly in publications use:
#
# C. Sievert. Interactive Web-Based Data Visualization with R, plotly,
# and shiny. Chapman and Hall/CRC Florida, 2020.
#
# A BibTeX entry for LaTeX users is
#
# @Book{,
# author = {Carson Sievert},
# title = {Interactive Web-Based Data Visualization with R, plotly, and shiny},
# publisher = {Chapman and Hall/CRC},
# year = {2020},
# isbn = {9781138331457},
# url = {https://plotly-r.com},
# }
#
# To cite package 'GGally' in publications use:
#
# Barret Schloerke, Di Cook, Joseph Larmarange, Francois Briatte,
# Moritz Marbach, Edwin Thoen, Amos Elberg and Jason Crowley (2021).
# GGally: Extension to 'ggplot2'. R package version 2.1.1.
# https://CRAN.R-project.org/package=GGally
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {GGally: Extension to 'ggplot2'},
# author = {Barret Schloerke and Di Cook and Joseph Larmarange and Francois Briatte and Moritz Marbach and Edwin Thoen and Amos Elberg and Jason Crowley},
# year = {2021},
# note = {R package version 2.1.1},
# url = {https://CRAN.R-project.org/package=GGally},
# }
#
# To cite package 'pracma' in publications use:
#
# Hans W. Borchers (2021). pracma: Practical Numerical Math Functions.
# R package version 2.3.3. https://CRAN.R-project.org/package=pracma
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {pracma: Practical Numerical Math Functions},
# author = {Hans W. Borchers},
# year = {2021},
# note = {R package version 2.3.3},
# url = {https://CRAN.R-project.org/package=pracma},
# }
#
# To cite the car package in publications use:
#
# John Fox and Sanford Weisberg (2019). An {R} Companion to Applied
# Regression, Third Edition. Thousand Oaks CA: Sage. URL:
# https://socialsciences.mcmaster.ca/jfox/Books/Companion/
#
# A BibTeX entry for LaTeX users is
#
# @Book{,
# title = {An {R} Companion to Applied Regression},
# edition = {Third},
# author = {John Fox and Sanford Weisberg},
# year = {2019},
# publisher = {Sage},
# address = {Thousand Oaks {CA}},
# url = {https://socialsciences.mcmaster.ca/jfox/Books/Companion/},
# }
#
# Please cite the 'rattle' package in publications using:
#
# Williams, G. J. (2011), Data Mining with Rattle and R: The Art of
# Excavating Data for Knowledge Discovery, Use R!, Springer.
#
# A BibTeX entry for LaTeX users is
#
# @Book{,
# title = {Data Mining with {Rattle} and {R}: The art of excavating data for knowledge discovery},
# author = {Graham J. Williams},
# publisher = {Springer},
# series = {Use R!},
# year = {2011},
# url = {http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=1441998896},
# }